2025-05-22-12-06
COSMIC: Enabling Full-Stack Co-Design and Optimization of Distributed Machine Learning Systems
Abstract
arXiv:2505.15020v1 Announce Type: new Abstract: Large-scale machine learning models necessitate distributed systems, posing significant design challenges due to the large parameter space across distinct design stacks. Existing studies often focus on optimizing individual system aspects in isolation. This work challenges this limitation and introduces COSMIC, a full-stack distributed machine learning systems environment enabling end-to-end simulation and agent-based design space exploration. To facilitate efficient exploration and optimization across the entire stack, we introduce Parameter Set Architecture-an abstraction concept analogous to the instruction set architecture-abstracting away configuration complexities of agent-based search methods. Case studies demonstrate COSMIC's ability to consolidate parameters across multiple layers of design abstraction, discovering eight non-obvious high-performance system configurations across four transformer-based models with up to 175 billion parameters. By optimizing across the stack, COSMIC full-stack optimization delivers 1.50-48.41x higher performance compared to the isolated single-stack optimization.
摘要
大规模机器学习模型需要分布式系统支持,由于不同设计栈间庞大的参数空间,这带来了重大设计挑战。现有研究往往孤立地优化单个系统层面。本研究突破了这一局限,提出COSMIC——一个支持端到端仿真和基于智能体的设计空间探索的全栈分布式机器学习系统环境。为促进跨全栈的高效探索与优化,我们提出了参数集架构(Parameter Set Architecture)这一抽象概念,其作用类似于指令集架构,可消除基于智能体的搜索方法在配置上的复杂性。案例研究表明,COSMIC能够整合跨多层级设计抽象的参数,在四个参数量高达1750亿的基于Transformer的模型中,发现了八种非显而易见的高性能系统配置。通过全栈优化,COSMIC相比孤立单栈优化实现了1.50-48.41倍的性能提升。
Balanced and Elastic End-to-end Training of Dynamic LLMs
Abstract
arXiv:2505.14864v1 Announce Type: new Abstract: To reduce computational and memory costs in Large Language Models (LLMs), dynamic workload reduction schemes like Mixture of Experts (MoEs), parameter pruning, layer freezing, sparse attention, early token exit, and Mixture of Depths (MoDs) have emerged. However, these methods introduce severe workload imbalances, limiting their practicality for large-scale distributed training. We propose DynMo, an autonomous dynamic load balancing solution that ensures optimal compute distribution when using pipeline parallelism in training dynamic models. DynMo adaptively balances workloads, dynamically packs tasks into fewer workers to free idle resources, and supports both multi-GPU single-node and multi-node systems. Compared to static training methods (Megatron-LM, DeepSpeed), DynMo accelerates training by up to 1.23x (MoEs), 3.18x (pruning), 2.23x (layer freezing), 4.02x (sparse attention), 4.52x (early exit), and 1.17x (MoDs). DynMo is available at https://anonymous.4open.science/r/DynMo-4D04/.
摘要
为降低大型语言模型(LLMs)的计算和内存成本,业界提出了多种动态工作负载缩减方案,如专家混合模型(MoEs)、参数剪枝、层冻结、稀疏注意力、早期令牌退出和深度混合模型(MoDs)。然而,这些方法会导致严重的负载不均衡问题,限制了其在大规模分布式训练中的实用性。我们提出DynMo——一种自主动态负载均衡解决方案,可在训练动态模型时通过流水线并行实现最优计算资源分配。DynMo能自适应平衡工作负载,动态将任务打包至更少的工作节点以释放闲置资源,并支持多GPU单节点与多节点系统。与静态训练方法(Megatron-LM、DeepSpeed)相比,DynMo在MoEs场景下训练速度提升达1.23倍,剪枝场景3.18倍,层冻结场景2.23倍,稀疏注意力场景4.02倍,早期退出场景4.52倍,MoDs场景1.17倍。DynMo项目地址:https://anonymous.4open.science/r/DynMo-4D04/。
FOL-Pretrain: A complexity annotated corpus of first-order logic
Abstract
arXiv:2505.14932v1 Announce Type: new Abstract: Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities such as coding and solving mathematical problems to commonsense inference. While these tasks vary in complexity, they all require models to integrate and compute over structured information. Despite recent efforts to reverse-engineer LLM behavior through controlled experiments, our understanding of how these models internalize and execute complex algorithms remains limited. Progress has largely been confined to small-scale studies or shallow tasks such as basic arithmetic and grammatical pattern matching. One barrier to deeper understanding is the nature of pretraining data -- vast, heterogeneous, and often poorly annotated, making it difficult to isolate mechanisms of reasoning. To bridge this gap, we introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces, designed to probe and analyze algorithmic reasoning in LLMs. The dataset consists of 3.5 billion tokens, including 8.8 million LLM-augmented, human-annotated examples and 7.5 million synthetically generated examples. Each synthetic example is verifiably correct, produced by a custom automated theorem solver, and accompanied by metadata tracing its algorithmic provenance. We aim to provide a scalable, interpretable artifact for studying how LLMs learn and generalize symbolic reasoning processes, paving the way for more transparent and targeted investigations into the algorithmic capabilities of modern models.
摘要
基于Transformer架构的大规模语言模型(LLMs)已展现出卓越的推理能力,涵盖从编程、数学问题求解到常识推理等多个领域。尽管这些任务的复杂度各异,但均要求模型对结构化信息进行整合与运算。尽管近期已有研究通过受控实验逆向解析LLM行为,我们对其内部实现复杂算法的机制理解仍显不足。现有进展主要局限于小规模研究或浅层任务,如基础算术和语法模式匹配。深入理解的障碍之一在于预训练数据的特性——海量、异构且往往缺乏标注,这使得分离推理机制变得困难。为弥合这一鸿沟,我们提出了一个大规模、完全开放且标注复杂度的一阶逻辑推理追踪数据集,旨在探究和分析LLM的算法推理能力。该数据集包含35亿标记,含880万条经LLM增强的人工标注样本和750万条合成生成样本。每条合成样本均由定制自动定理证明器生成,其正确性可验证,并附带追溯算法来源的元数据。我们期望通过这一可扩展、可解释的数据集,为研究LLM如何学习与泛化符号推理过程提供工具,从而为现代模型的算法能力研究开辟更透明、更具针对性的路径。
Generalised Probabilistic Modelling and Improved Uncertainty Estimation in Comparative LLM-as-a-judge
Abstract
arXiv:2505.15240v1 Announce Type: new Abstract: This paper explores generalised probabilistic modelling and uncertainty estimation in comparative LLM-as-a-judge frameworks. We show that existing Product-of-Experts methods are specific cases of a broader framework, enabling diverse modelling options. Furthermore, we propose improved uncertainty estimates for individual comparisons, enabling more efficient selection and achieving strong performance with fewer evaluations. We also introduce a method for estimating overall ranking uncertainty. Finally, we demonstrate that combining absolute and comparative scoring improves performance. Experiments show that the specific expert model has a limited impact on final rankings but our proposed uncertainty estimates, especially the probability of reordering, significantly improve the efficiency of systems reducing the number of needed comparisons by ~50%. Furthermore, ranking-level uncertainty metrics can be used to identify low-performing predictions, where the nature of the probabilistic model has a notable impact on the quality of the overall uncertainty.
摘要
本文探讨了比较性LLM-as-a-judge框架中的广义概率建模与不确定性估计。研究表明,现有专家乘积方法是更广泛框架的特例,该框架支持多样化的建模选择。我们进一步提出了改进的个体比较不确定性估计方法,可实现更高效的选择,并通过更少的评估次数获得强劲性能。同时,我们提出了一种估计整体排序不确定性的新方法。实验证明,结合绝对评分与比较评分能提升系统性能。具体而言,专家模型对最终排序影响有限,但我们提出的不确定性估计(尤其是重排序概率)能显著提升系统效率,将所需比较次数减少约50%。此外,排序级不确定性指标可用于识别低质量预测,其中概率模型的特性对整体不确定性质量具有显著影响。
When Can Large Reasoning Models Save Thinking? Mechanistic Analysis of Behavioral Divergence in Reasoning
Abstract
arXiv:2505.15276v1 Announce Type: new Abstract: Large reasoning models (LRMs) have significantly advanced performance on complex tasks, yet their tendency to overthink introduces inefficiencies. This study investigates the internal mechanisms of reinforcement learning (RL)-trained LRMs when prompted to save thinking, revealing three distinct thinking modes: no thinking (NT), explicit thinking (ET), and implicit thinking (IT). Through comprehensive analysis of confidence in thinking termination, attention from thinking to generation, and attentional focus on input sections, we uncover key factors influencing the reasoning behaviors. We further find that NT reduces output length at the cost of accuracy, while ET and IT maintain accuracy with reduced response length. Our findings expose fundamental inconsistencies in RL-optimized LRMs, necessitating adaptive improvements for reliable efficiency.
摘要
大型推理模型(LRMs)在复杂任务上取得了显著性能提升,但其过度思考倾向导致效率低下。本研究探究了经过强化学习(RL)训练的LRMs在要求节省思考时的内部机制,揭示了三种不同的思考模式:无思考(NT)、显性思考(ET)和隐性思考(IT)。通过对思考终止置信度、从思考到生成的注意力转移以及输入部分关注焦点的综合分析,我们发现了影响推理行为的关键因素。进一步研究发现,NT模式以降低准确性为代价缩短输出长度,而ET和IT模式能在保持准确性的同时减少响应长度。我们的研究结果揭示了RL优化LRMs中存在的基本不一致性,亟需通过自适应改进来实现可靠的效率提升。
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
Abstract
arXiv:2505.15068v1 Announce Type: new Abstract: Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect the complexity of real-world problems, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce ModelingBench, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench also supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. We also present ModelingAgent, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. To evaluate outputs, we further propose ModelingJudge, an expert-in-the-loop system leveraging LLMs as domain-specialized judges assessing solutions from multiple expert perspectives. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges.
摘要
大语言模型(LLMs)的最新进展在解决数学问题方面取得了显著突破。然而,现有基准测试往往无法反映现实世界问题的复杂性,这些问题需要开放式的跨学科推理以及计算工具的整合。为填补这一空白,我们提出了ModelingBench——一个新颖的基准测试,其灵感来源于现实世界,包含从城市交通优化到生态系统资源规划等多个领域的数学建模竞赛中的开放式问题。这些任务要求将自然语言转化为正式的数学表述,应用适当的工具,并生成结构化的、可辩护的报告。ModelingBench还支持多种有效解决方案,以捕捉实际建模中的模糊性和创造性。我们还提出了ModelingAgent,这是一个多智能体框架,能够协调工具使用、支持结构化工作流程,并实现迭代自我优化,从而生成有据可依的创造性解决方案。为了评估输出结果,我们进一步提出了ModelingJudge,这是一个专家参与循环的系统,利用LLMs作为领域专业评委,从多个专家视角评估解决方案。实证结果表明,ModelingAgent显著优于强基线模型,其生成的解决方案往往与人类专家的方案难以区分。总之,我们的工作为评估和推进开放式跨学科建模挑战中的现实问题解决提供了一个全面框架。
Reinforcement Learning from User Feedback
Abstract
arXiv:2505.14946v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, P[Love], to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate P[Love] into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that P[Love] is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using P[Love] significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.
摘要
随着大语言模型(LLMs)在多样化用户端应用中的日益普及,使其与真实用户偏好保持一致变得至关重要。现有方法如基于人类反馈的强化学习(RLHF)依赖于经过人工定义准则培训的专家标注者,其判断可能无法反映普通用户的优先级。我们提出基于用户反馈的强化学习(RLUF),该框架通过直接利用生产环境中用户的隐式信号来实现LLMs的对齐。RLUF解决了用户反馈的关键挑战:用户反馈通常是二元化的(如表情符号反应)、稀疏的且偶尔具有对抗性。我们训练了一个奖励模型P[Love]来预测LLM回复获得"爱心反应"(一种轻量级正向用户反馈形式)的概率,并将P[Love]与有用性和安全性目标共同整合到多目标策略优化框架中。大规模实验表明,P[Love]能有效预测正向反馈的增长,并可作为未来用户行为的可靠离线评估指标。使用P[Love]进行策略优化显著提升了观测到的正向反馈率,包括在实时A/B测试中"爱心反应"增加28%。然而,优化正向反应会引发奖励破解挑战,需要谨慎平衡各项目标。通过直接利用用户的隐式信号,RLUF为大规模实现LLMs与现实用户偏好的对齐提供了可行路径。
Self-Evolving Curriculum for LLM Reasoning
Abstract
arXiv:2505.14970v1 Announce Type: new Abstract: Reinforcement learning (RL) has proven effective for fine-tuning large language models (LLMs), significantly enhancing their reasoning abilities in domains such as mathematics and code generation. A crucial factor influencing RL fine-tuning success is the training curriculum: the order in which training problems are presented. While random curricula serve as common baselines, they remain suboptimal; manually designed curricula often rely heavily on heuristics, and online filtering methods can be computationally prohibitive. To address these limitations, we propose Self-Evolving Curriculum (SEC), an automatic curriculum learning method that learns a curriculum policy concurrently with the RL fine-tuning process. Our approach formulates curriculum selection as a non-stationary Multi-Armed Bandit problem, treating each problem category (e.g., difficulty level or problem type) as an individual arm. We leverage the absolute advantage from policy gradient methods as a proxy measure for immediate learning gain. At each training step, the curriculum policy selects categories to maximize this reward signal and is updated using the TD(0) method. Across three distinct reasoning domains: planning, inductive reasoning, and mathematics, our experiments demonstrate that SEC significantly improves models' reasoning capabilities, enabling better generalization to harder, out-of-distribution test problems. Additionally, our approach achieves better skill balance when fine-tuning simultaneously on multiple reasoning domains. These findings highlight SEC as a promising strategy for RL fine-tuning of LLMs.
摘要
强化学习(RL)已被证明能有效微调大语言模型(LLMs),显著提升其在数学和代码生成等领域的推理能力。影响RL微调成功的关键因素是训练课程——即训练问题呈现的顺序。虽然随机课程作为常见基线,但其效果仍欠佳;手动设计的课程通常严重依赖启发式方法,而在线过滤方法可能计算成本过高。为解决这些局限,我们提出自进化课程(SEC),这是一种在RL微调过程中同步学习课程策略的自动课程学习方法。该方法将课程选择建模为非平稳多臂老虎机问题,将每个问题类别(如难度级别或问题类型)视为独立臂。我们利用策略梯度方法的绝对优势作为即时学习收益的代理指标。在每一步训练中,课程策略选择能最大化该奖励信号的类别,并通过TD(0)方法进行更新。在规划、归纳推理和数学三个不同推理领域的实验中,SEC显著提升了模型的推理能力,使其能更好地泛化至更难的分布外测试问题。此外,当在多个推理领域同时微调时,该方法能实现更好的技能平衡。这些发现表明SEC是LLMs强化学习微调的一种有效策略。
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning
Abstract
arXiv:2505.15400v1 Announce Type: new Abstract: Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of "Internal Self-Recovery Mechanism" where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.
摘要
大型推理模型(LRMs)通过长推理链实现了卓越性能,但由于冗余推理(尤其在简单任务上)常导致过高计算开销。本研究系统量化了LRMs在"长思考"与"无思考"模式下的性能上限,揭示了模型在答案生成过程中隐式补充推理的"内部自恢复机制"现象。基于此发现,我们提出自适应自恢复推理框架(ASRR),通过抑制非必要推理并启用隐式恢复机制,结合精度感知的长度奖励调节,根据问题难度自适应分配推理资源,以可忽略的性能代价实现高效推理。跨多基准和模型的实验表明:相较于GRPO,ASRR在1.5B和7B模型上分别最高减少32.5%和25.7%的推理预算(仅损失1.2%和0.6%的pass@1准确率),并在安全基准上显著提升无害率(最高+21.7%)。研究结果证明了ASRR在实现高效、自适应且更安全的LRMs推理方面的潜力。
lmgame-Bench: How Good are LLMs at Playing Games?
Abstract
arXiv:2505.15146v1 Announce Type: new Abstract: Playing video games requires perception, memory, and planning, exactly the faculties modern large language model (LLM) agents are expected to master. We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons -- brittle vision perception, prompt sensitivity, and potential data contamination. We introduce lmgame-Bench to turn games into reliable evaluations. lmgame-Bench features a suite of platformer, puzzle, and narrative games delivered through a unified Gym-style API and paired with lightweight perception and memory scaffolds, and is designed to stabilize prompt variance and remove contamination. Across 13 leading models, we show lmgame-Bench is challenging while still separating models well. Correlation analysis shows that every game probes a unique blend of capabilities often tested in isolation elsewhere. More interestingly, performing reinforcement learning on a single game from lmgame-Bench transfers both to unseen games and to external planning tasks. Our evaluation code is available at https://github.com/lmgame-org/GamingAgent/lmgame-bench.
摘要
电子游戏操作需要感知、记忆与规划能力,这正是现代大语言模型(LLM)智能体被要求掌握的核心能力。本研究分析了利用主流电子游戏评估现代LLM的主要挑战,发现直接将其植入游戏无法实现有效评估,原因有三——脆弱的视觉感知、提示敏感度及潜在数据污染。为此,我们推出lmgame-Bench评估框架,通过标准化方法将游戏转化为可靠评估工具。该框架集成平台跳跃、解谜与叙事类游戏,通过统一Gym风格API交付,配备轻量级感知与记忆支架,旨在稳定提示差异并消除数据污染。基于13个前沿模型的测试表明,lmgame-Bench在保持高区分度的同时具备足够挑战性。相关性分析显示,每款游戏都能探测模型独特的能力组合,这些能力在其他测试中往往被孤立检验。更有趣的是,在lmgame-Bench单个游戏上进行的强化学习,其能力可迁移至未见游戏及外部规划任务。评估代码已开源:https://github.com/lmgame-org/GamingAgent/lmgame-bench。
ClickSight: Interpreting Student Clickstreams to Reveal Insights on Learning Strategies via LLMs
Abstract
arXiv:2505.15410v1 Announce Type: new Abstract: Clickstream data from digital learning environments offer valuable insights into students' learning behaviors, but are challenging to interpret due to their high dimensionality and granularity. Prior approaches have relied mainly on handcrafted features, expert labeling, clustering, or supervised models, therefore often lacking generalizability and scalability. In this work, we introduce ClickSight, an in-context Large Language Model (LLM)-based pipeline that interprets student clickstreams to reveal their learning strategies. ClickSight takes raw clickstreams and a list of learning strategies as input and generates textual interpretations of students' behaviors during interaction. We evaluate four different prompting strategies and investigate the impact of self-refinement on interpretation quality. Our evaluation spans two open-ended learning environments and uses a rubric-based domain-expert evaluation. Results show that while LLMs can reasonably interpret learning strategies from clickstreams, interpretation quality varies by prompting strategy, and self-refinement offers limited improvement. ClickSight demonstrates the potential of LLMs to generate theory-driven insights from educational interaction data.
摘要
数字学习环境中的点击流数据为理解学生学习行为提供了宝贵洞见,但由于其高维度和细粒度特性,解读存在挑战。现有方法主要依赖手工特征工程、专家标注、聚类或监督模型,普遍存在泛化性和可扩展性不足的问题。本研究提出ClickSight——一种基于大语言模型(LLM)的情境化分析流程,通过解读学生点击流揭示其学习策略。该系统以原始点击流和学习策略列表作为输入,生成描述学生交互行为的文本解释。我们评估了四种不同的提示策略,并探究了自我优化对解释质量的影响。实验涵盖两个开放式学习环境,采用基于量表的领域专家评估。结果表明:虽然大语言模型能够合理地从点击流中解读学习策略,但解释质量因提示策略而异,且自我优化带来的改进有限。ClickSight证实了大语言模型从教育交互数据中生成理论驱动型洞见的潜力。
THELMA: Task Based Holistic Evaluation of Large Language Model Applications-RAG Question Answering
Abstract
arXiv:2505.11626v1 Announce Type: cross Abstract: We propose THELMA (Task Based Holistic Evaluation of Large Language Model Applications), a reference free framework for RAG (Retrieval Augmented generation) based question answering (QA) applications. THELMA consist of six interdependent metrics specifically designed for holistic, fine grained evaluation of RAG QA applications. THELMA framework helps developers and application owners evaluate, monitor and improve end to end RAG QA pipelines without requiring labelled sources or reference responses.We also present our findings on the interplay of the proposed THELMA metrics, which can be interpreted to identify the specific RAG component needing improvement in QA applications.
摘要
我们提出THELMA(基于任务的大语言模型应用整体评估框架),这是一个无需参考的检索增强生成(RAG)问答应用评估框架。THELMA包含六个相互关联的指标,专门用于对RAG问答应用进行整体细粒度评估。该框架帮助开发者和应用所有者无需标注数据或参考答案即可评估、监控和改进端到端RAG问答流程。我们还揭示了THELMA指标间的相互作用规律,通过解读这些规律可识别问答应用中需要改进的特定RAG组件。
Toward Open Earth Science as Fast and Accessible as Natural Language
Abstract
arXiv:2505.15690v1 Announce Type: new Abstract: Is natural-language-driven earth observation data analysis now feasible with the assistance of Large Language Models (LLMs)? For open science in service of public interest, feasibility requires reliably high accuracy, interactive latencies, low (sustainable) costs, open LLMs, and openly maintainable software -- hence, the challenge. What are the techniques and programming system requirements necessary for satisfying these constraints, and what is the corresponding development and maintenance burden in practice? This study lays the groundwork for exploring these questions, introducing an impactful earth science use-case, and providing a software framework with evaluation data and metrics, along with initial results from employing model scaling, prompt-optimization, and inference-time scaling optimization techniques. While we attain high accuracy (near 100%) across 10 of 11 metrics, the analysis further considers cost (token-spend), latency, and maintainability across this space of techniques. Finally, we enumerate opportunities for further research, general programming and evaluation framework development, and ongoing work for a comprehensive, deployable solution. This is a call for collaboration and contribution.
摘要
在大型语言模型(LLMs)的辅助下,基于自然语言驱动的地球观测数据分析目前是否可行?为服务于公共利益的开放科学,可行性需满足以下要求:持续可靠的高精度、交互式低延迟、低成本(可持续性)、开放的大型语言模型以及可公开维护的软件——这正是挑战所在。满足这些约束条件需要哪些技术及编程系统要求?实际开发与维护负担如何?本研究为探索这些问题奠定基础,引入了一个具有影响力的地球科学应用案例,并提供了包含评估数据、指标的软件框架,以及采用模型缩放、提示优化和推理时间缩放优化技术的初步结果。虽然我们在11项指标中的10项实现了高精度(接近100%),但分析进一步考量了这些技术方案在成本(token消耗)、延迟和可维护性方面的表现。最后,我们列举了未来研究方向、通用编程与评估框架开发机遇,以及构建全面可部署解决方案的后续工作。此研究旨在呼吁合作与贡献。
The Energy Cost of Reasoning: Analyzing Energy Usage in LLMs with Test-time Compute
Abstract
arXiv:2505.14733v1 Announce Type: cross Abstract: Scaling large language models (LLMs) has driven significant advancements, yet it faces diminishing returns and escalating energy demands. This work introduces test-time compute (TTC)-allocating additional computational resources during inference-as a compelling complement to conventional scaling strategies. Specifically, we investigate whether employing TTC can achieve superior accuracy-energy trade-offs compared to simply increasing model size. Our empirical analysis reveals that TTC surpasses traditional model scaling in accuracy/energy efficiency, with notable gains in tasks demanding complex reasoning rather than mere factual recall. Further, we identify a critical interaction between TTC performance and output sequence length, demonstrating that strategically adjusting compute resources at inference time according to query complexity can substantially enhance efficiency. Our findings advocate for TTC as a promising direction, enabling more sustainable, accurate, and adaptable deployment of future language models without incurring additional pretraining costs.
摘要
大规模语言模型(LLM)的扩展虽推动了显著技术进步,却面临收益递减与能耗激增的问题。本研究提出测试时计算分配(TTC)——在推理阶段动态分配额外计算资源——作为传统扩展策略的创新补充。通过实证分析,我们发现相较于单纯增加模型规模,TTC能在准确率/能效比上实现更优权衡,尤其在需要复杂推理而非单纯事实检索的任务中表现突出。进一步研究发现TTC性能与输出序列长度存在关键交互作用:根据查询复杂度在推理时策略性调整计算资源,可显著提升效率。本研究论证了TTC作为未来语言模型部署的新方向,能在不增加预训练成本的前提下,实现更可持续、精准且自适应的模型应用。
\texttt{LLINBO}: Trustworthy LLM-in-the-Loop Bayesian Optimization
Abstract
arXiv:2505.14756v1 Announce Type: cross Abstract: Bayesian optimization (BO) is a sequential decision-making tool widely used for optimizing expensive black-box functions. Recently, Large Language Models (LLMs) have shown remarkable adaptability in low-data regimes, making them promising tools for black-box optimization by leveraging contextual knowledge to propose high-quality query points. However, relying solely on LLMs as optimization agents introduces risks due to their lack of explicit surrogate modeling and calibrated uncertainty, as well as their inherently opaque internal mechanisms. This structural opacity makes it difficult to characterize or control the exploration-exploitation trade-off, ultimately undermining theoretical tractability and reliability. To address this, we propose LLINBO: LLM-in-the-Loop BO, a hybrid framework for BO that combines LLMs with statistical surrogate experts (e.g., Gaussian Processes (GP)). The core philosophy is to leverage contextual reasoning strengths of LLMs for early exploration, while relying on principled statistical models to guide efficient exploitation. Specifically, we introduce three mechanisms that enable this collaboration and establish their theoretical guarantees. We end the paper with a real-life proof-of-concept in the context of 3D printing. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO.
摘要
贝叶斯优化(BO)是一种广泛用于优化昂贵黑箱函数的序列决策工具。近年来,大型语言模型(LLMs)在低数据量场景中展现出卓越的适应性,使其有望通过利用上下文知识提出高质量查询点,成为黑箱优化的新工具。然而,仅依赖LLMs作为优化代理存在风险,因其缺乏显式代理建模和校准的不确定性,且其内部机制本质不透明。这种结构不透明性使得难以表征或控制探索-开发的权衡,最终削弱理论可解性和可靠性。为此,我们提出LLINBO:循环贝叶斯优化中的LLM(LLM-in-the-Loop BO),一种将LLMs与统计代理专家(如高斯过程(GP))相结合的混合BO框架。其核心思想是利用LLMs的上下文推理优势进行早期探索,同时依靠原则性统计模型指导高效开发。具体而言,我们引入三种机制实现这种协作,并建立其理论保证。最后,我们通过3D打印的实际概念验证结束本文。重现结果的代码可在https://github.com/UMDataScienceLab/LLM-in-the-Loop-BO获取。
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Abstract
arXiv:2505.14810v1 Announce Type: cross Abstract: Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.
摘要
指令跟随对于将大型语言模型(LLMs)与用户意图对齐至关重要。尽管近期以推理为导向的模型在复杂数学问题上展现出卓越性能,但其遵循自然语言指令的能力仍未被充分探索。本研究提出MathIF——一个专门用于评估数学推理任务中指令跟随能力的基准。实证分析表明,扩展推理能力与保持可控性之间存在持续张力:推理能力越强的模型往往越难遵循用户指令。我们发现,基于蒸馏长思维链微调的模型或采用推理导向强化学习训练的模型,其指令遵循能力通常会下降,尤其在生成长度增加时。此外,即使简单干预也能部分恢复模型服从性,但会以牺牲推理性能为代价。这些发现揭示了当前LLM训练范式的根本矛盾,并表明需要开发更具指令感知能力的推理模型。代码与数据已发布于https://github.com/TingchenFu/MathIF。
Text Generation Beyond Discrete Token Sampling
Abstract
arXiv:2505.14827v1 Announce Type: cross Abstract: In standard autoregressive generation, an LLM predicts the next-token distribution, samples a discrete token, and then discards the distribution, passing only the sampled token as new input. To preserve this distribution's rich information, we propose Mixture of Inputs (MoI), a training-free method for autoregressive generation. After generating a token following the standard paradigm, we construct a new input that blends the generated discrete token with the previously discarded token distribution. Specifically, we employ a Bayesian estimation method that treats the token distribution as the prior, the sampled token as the observation, and replaces the conventional one-hot vector with the continuous posterior expectation as the new model input. MoI allows the model to maintain a richer internal representation throughout the generation process, resulting in improved text quality and reasoning capabilities. On mathematical reasoning, code generation, and PhD-level QA tasks, MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B, with no additional training and negligible computational overhead.
摘要
在标准的自回归生成中,大型语言模型(LLM)会预测下一个令牌的分布,采样一个离散令牌,然后丢弃该分布,仅将采样的令牌作为新输入传递。为了保留这一分布所蕴含的丰富信息,我们提出了输入混合(Mixture of Inputs, MoI)方法——一种无需训练的自回归生成技术。该方法在遵循标准范式生成令牌后,会构建一个融合了已生成离散令牌与先前被丢弃令牌分布的新输入。具体而言,我们采用贝叶斯估计方法,将令牌分布视为先验概率,采样令牌作为观测值,并用连续后验期望替代传统的独热向量作为新模型输入。MoI使模型能在整个生成过程中维持更丰富的内部表征,从而提升文本质量和推理能力。在数学推理、代码生成和博士级问答任务中,MoI无需额外训练且计算开销可忽略不计的情况下,持续提升了包括QwQ-32B、Nemotron-Super-49B、Gemma-3-27B和DAPO-Qwen-32B在内的多个模型的性能表现。
Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis
Abstract
arXiv:2505.14742v1 Announce Type: cross Abstract: Large language models (LLMs) have made exciting achievements across various domains, yet their deployment on resource-constrained personal devices remains hindered by the prohibitive computational and memory demands of task-specific fine-tuning. While quantization offers a pathway to efficiency, existing methods struggle to balance performance and overhead, either incurring high computational/memory costs or failing to address activation outliers, a critical bottleneck in quantized fine-tuning. To address these challenges, we propose the Outlier Spatial Stability Hypothesis (OSSH): During fine-tuning, certain activation outlier channels retain stable spatial positions across training iterations. Building on OSSH, we propose Quaff, a Quantized parameter-efficient fine-tuning framework for LLMs, optimizing low-precision activation representations through targeted momentum scaling. Quaff dynamically suppresses outliers exclusively in invariant channels using lightweight operations, eliminating full-precision weight storage and global rescaling while reducing quantization errors. Extensive experiments across ten benchmarks validate OSSH and demonstrate Quaff's efficacy. Specifically, on the GPQA reasoning benchmark, Quaff achieves a 1.73x latency reduction and 30% memory savings over full-precision fine-tuning while improving accuracy by 0.6% on the Phi-3 model, reconciling the triple trade-off between efficiency, performance, and deployability. By enabling consumer-grade GPU fine-tuning (e.g., RTX 2080 Super) without sacrificing model utility, Quaff democratizes personalized LLM deployment. The code is available at https://github.com/Little0o0/Quaff.git.
摘要
大型语言模型(LLMs)在各领域取得了令人瞩目的成就,但其在资源受限的个人设备上的部署仍受限于任务特定微调所需的高昂计算与内存开销。尽管量化技术提供了效率提升路径,现有方法难以平衡性能与开销——要么导致高计算/内存成本,要么无法处理激活异常值这一量化微调中的关键瓶颈。针对这些挑战,我们提出"异常值空间稳定性假说"(OSSH):在微调过程中,特定激活异常通道会保持跨训练迭代的空间位置稳定性。基于OSSH,我们提出量化参数高效微调框架Quaff,通过定向动量缩放优化低精度激活表示。Quaff利用轻量级操作动态抑制不变通道中的异常值,无需全精度权重存储和全局重缩放,同时降低量化误差。在十个基准测试上的广泛实验验证了OSSH假说并证明了Quaff的有效性。具体而言,在GPQA推理基准测试中,Quaff相比全精度微调实现了1.73倍的延迟降低和30%的内存节省,同时在Phi-3模型上准确率提升0.6%,实现了效率、性能与可部署性的三重平衡。通过在不牺牲模型效用的前提下支持消费级GPU(如RTX 2080 Super)微调,Quaff推动了个性化LLM部署的普及。代码已开源:https://github.com/Little0o0/Quaff.git。
A Comparative Study of Large Language Models and Human Personality Traits
Abstract
arXiv:2505.14845v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated human-like capabilities in language comprehension and generation, becoming active participants in social and cognitive domains. This study investigates whether LLMs exhibit personality-like traits and how these traits compare with human personality, focusing on the applicability of conventional personality assessment tools. A behavior-based approach was used across three empirical studies. Study 1 examined test-retest stability and found that LLMs show higher variability and are more input-sensitive than humans, lacking long-term stability. Based on this, we propose the Distributed Personality Framework, conceptualizing LLM traits as dynamic and input-driven. Study 2 analyzed cross-variant consistency in personality measures and found LLMs' responses were highly sensitive to item wording, showing low internal consistency compared to humans. Study 3 explored personality retention during role-playing, showing LLM traits are shaped by prompt and parameter settings. These findings suggest that LLMs express fluid, externally dependent personality patterns, offering insights for constructing LLM-specific personality frameworks and advancing human-AI interaction. This work contributes to responsible AI development and extends the boundaries of personality psychology in the age of intelligent systems.
摘要
大语言模型(LLMs)在语言理解与生成方面展现出类人能力,已成为社会和认知领域的积极参与者。本研究探讨LLMs是否表现出类人格特质,以及这些特质与人类人格的异同,重点关注传统人格评估工具的适用性。通过三项实证研究采用基于行为的方法:研究1检验了重测稳定性,发现LLMs较人类表现出更高变异性和输入敏感性,缺乏长期稳定性。据此我们提出分布式人格框架,将LLM特质概念化为动态且输入驱动的属性。研究2分析了人格测量的跨版本一致性,发现LLMs的响应高度依赖条目措辞,与人类相比内部一致性较低。研究3探究角色扮演中的人格保持性,显示LLM特质受提示语和参数设置塑造。这些发现表明LLMs表达出流动的、外部依赖性的人格模式,为构建LLM专用人格框架和推进人机交互提供了新见解。本工作有助于负责任AI发展,并拓展了智能系统时代人格心理学的边界。
MAATS: A Multi-Agent Automated Translation System Based on MQM Evaluation
Abstract
arXiv:2505.14848v1 Announce Type: cross Abstract: We present MAATS, a Multi Agent Automated Translation System that leverages the Multidimensional Quality Metrics (MQM) framework as a fine-grained signal for error detection and refinement. MAATS employs multiple specialized AI agents, each focused on a distinct MQM category (e.g., Accuracy, Fluency, Style, Terminology), followed by a synthesis agent that integrates the annotations to iteratively refine translations. This design contrasts with conventional single-agent methods that rely on self-correction. Evaluated across diverse language pairs and Large Language Models (LLMs), MAATS outperforms zero-shot and single-agent baselines with statistically significant gains in both automatic metrics and human assessments. It excels particularly in semantic accuracy, locale adaptation, and linguistically distant language pairs. Qualitative analysis highlights its strengths in multi-layered error diagnosis, omission detection across perspectives, and context-aware refinement. By aligning modular agent roles with interpretable MQM dimensions, MAATS narrows the gap between black-box LLMs and human translation workflows, shifting focus from surface fluency to deeper semantic and contextual fidelity.
摘要
我们提出MAATS(多智能体自动翻译系统),该系统利用多维质量指标(MQM)框架作为细粒度信号进行错误检测与优化。MAATS采用多个专用AI智能体,每个智能体专注于特定MQM类别(如准确性、流畅性、风格、术语),再由合成智能体整合标注以迭代优化翻译。这种设计与依赖自我校正的传统单智能体方法形成鲜明对比。
通过在多语言对和大语言模型(LLM)上的评估,MAATS在自动指标和人工评估中均显著优于零样本和单智能体基线,尤其在语义准确性、地域适应性及语言距离较远的语对中表现突出。定性分析表明其优势体现在多层错误诊断、多视角遗漏检测及上下文感知优化方面。通过将模块化智能体角色与可解释的MQM维度对齐,MAATS缩小了黑盒LLM与人工翻译流程间的差距,将优化重点从表层流畅性转向更深层的语义与上下文保真度。
WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
Abstract
arXiv:2505.14818v1 Announce Type: cross Abstract: Robustly evaluating the long-form storytelling capabilities of Large Language Models (LLMs) remains a significant challenge, as existing benchmarks often lack the necessary scale, diversity, or objective measures. To address this, we introduce WebNovelBench, a novel benchmark specifically designed for evaluating long-form novel generation. WebNovelBench leverages a large-scale dataset of over 4,000 Chinese web novels, framing evaluation as a synopsis-to-story generation task. We propose a multi-faceted framework encompassing eight narrative quality dimensions, assessed automatically via an LLM-as-Judge approach. Scores are aggregated using Principal Component Analysis and mapped to a percentile rank against human-authored works. Our experiments demonstrate that WebNovelBench effectively differentiates between human-written masterpieces, popular web novels, and LLM-generated content. We provide a comprehensive analysis of 24 state-of-the-art LLMs, ranking their storytelling abilities and offering insights for future development. This benchmark provides a scalable, replicable, and data-driven methodology for assessing and advancing LLM-driven narrative generation.
摘要
稳健评估大语言模型(LLMs)的长篇叙事能力仍存在重大挑战,现有基准测试往往缺乏必要的规模、多样性或客观衡量标准。为此,我们提出WebNovelBench——一个专为评估长篇小说生成而设计的新型基准。该基准利用包含4,000余部中文网络小说的大规模数据集,将评估任务构建为'概要到故事'的生成框架。我们提出一个包含八个叙事质量维度的多层面评估体系,通过'LLM即评委'方法实现自动化测评,并采用主成分分析法聚合分数后映射至人类作品的百分位排名。实验表明,WebNovelBench能有效区分人类创作的经典作品、流行网络小说与LLM生成内容。我们对24个前沿LLM进行了全面分析,排序其叙事能力并为未来发展提供洞见。该基准为评估和推进LLM驱动的叙事生成提供了可扩展、可复现且数据驱动的方法论。
Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity
Abstract
arXiv:2505.14884v1 Announce Type: cross Abstract: Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters, shows promise but does not scale to large batch sizes due to union of active neurons quickly approaching dense computation. We introduce Polar Sparsity, highlighting a key shift in sparsity importance from MLP to Attention layers as we scale batch size and sequence length. While MLP layers become more compute-efficient under batching, their sparsity vanishes. In contrast, attention becomes increasingly more expensive at scale, while their head sparsity remains stable and batch-invariant. We develop hardware-efficient, sparsity-aware GPU kernels for selective MLP and Attention computations, delivering up to (2.2\times) end-to-end speedups for models like OPT, LLaMA-2 & 3, across various batch sizes and sequence lengths without compromising accuracy. To our knowledge, this is the first work to demonstrate that contextual sparsity can scale effectively to large batch sizes, delivering substantial inference acceleration with minimal changes, making Polar Sparsity practical for large-scale, high-throughput LLM deployment systems. Our code is available at: https://github.com/susavlsh10/Polar-Sparsity.
摘要
加速大型语言模型(LLM)推理对于需要高吞吐量和低延迟的实际部署至关重要。上下文稀疏性(即每个令牌动态激活仅一小部分模型参数)虽展现出潜力,但由于活跃神经元的并集迅速逼近密集计算,该方法难以扩展至大批量场景。我们提出极性稀疏性,揭示了当批量大小与序列长度增加时,稀疏性重要性从MLP层向注意力层的关键转变:MLP层在批处理下计算效率提升但其稀疏性消失,而注意力计算成本随规模增长显著增加,其头部稀疏性却保持稳定且与批量无关。我们开发了硬件高效的稀疏感知GPU内核,用于选择性MLP和注意力计算,在保持精度前提下为OPT、LLaMA-2/3等模型在不同批量与序列长度下带来最高2.2倍的端到端加速。据我们所知,这是首个证明上下文稀疏性可有效扩展至大批量的研究,通过极简修改实现显著推理加速,使极性稀疏性适用于大规模高吞吐LLM部署系统。代码已开源:https://github.com/susavlsh10/Polar-Sparsity。
Soft Prompts for Evaluation: Measuring Conditional Distance of Capabilities
Abstract
arXiv:2505.14943v1 Announce Type: cross Abstract: To help evaluate and understand the latent capabilities of language models, this paper introduces an approach using optimized input embeddings, or 'soft prompts,' as a metric of conditional distance between a model and a target behavior. The technique aims to facilitate latent capability discovery as a part of automated red teaming/evaluation suites and to provide quantitative feedback about the accessibility of potentially concerning behaviors in a way that may scale to powerful future models, including those which may otherwise be capable of deceptive alignment. An evaluation framework using soft prompts is demonstrated in natural language, chess, and pathfinding, and the technique is extended with generalized conditional soft prompts to aid in constructing task evaluations.
摘要
为帮助评估和理解语言模型的潜在能力,本文提出一种采用优化输入嵌入(即"软提示")作为模型与目标行为间条件距离度量指标的方法。该技术旨在将潜在能力发现作为自动化红队测试/评估套件的组成部分,并通过可量化的反馈机制评估潜在风险行为的可及性,这种方法未来可扩展至更强大的模型(包括那些可能具备欺骗性对齐能力的模型)。研究通过自然语言处理、国际象棋和路径规划三个领域展示了基于软提示的评估框架,并进一步提出广义条件软提示技术以辅助构建任务评估体系。
Scaling Laws for State Dynamics in Large Language Models
Abstract
arXiv:2505.14892v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly used in tasks requiring internal state tracking, yet their ability to model state transition dynamics remains poorly understood. We evaluate how well LLMs capture deterministic state dynamics across 3 domains: Box Tracking, Abstract DFA Sequences, and Complex Text Games, each formalizable as a finite-state system. Across tasks, we find that next-state prediction accuracy degrades with increasing state-space size and sparse transitions. GPT-2 XL reaches about 70% accuracy in low-complexity settings but drops below 30% when the number of boxes or states exceeds 5 or 10, respectively. In DFA tasks, Pythia-1B fails to exceed 50% accuracy when the number of states is > 10 and transitions are < 30. Through activation patching, we identify attention heads responsible for propagating state information: GPT-2 XL Layer 22 Head 20, and Pythia-1B Heads at Layers 10, 11, 12, and 14. While these heads successfully move relevant state features, action information is not reliably routed to the final token, indicating weak joint state-action reasoning. Our results suggest that state tracking in LLMs emerges from distributed interactions of next-token heads rather than explicit symbolic computation.
摘要
大型语言模型(LLMs)在需要内部状态追踪的任务中应用日益广泛,但其对状态转移动态的建模能力仍不甚明晰。本研究评估了LLMs在三个可形式化为有限状态系统的领域(方块追踪、抽象DFA序列和复杂文本游戏)中捕捉确定性状态动态的能力。实验发现:跨任务场景下,下一状态预测准确率随状态空间规模扩大和转移稀疏性增加而下降。GPT-2 XL在低复杂度环境下可达约70%准确率,但当方块数量或状态数分别超过5或10时,准确率降至30%以下;在DFA任务中,当状态数>10且转移数<30时,Pythia-1B模型准确率无法突破50%。通过激活修补技术,我们识别出负责状态信息传播的注意力头:GPT-2 XL第22层第20号头,以及Pythia-1B第10、11、12和14层的注意力头。虽然这些头能成功传递相关状态特征,但动作信息未被可靠路由至最终标记,表明联合状态-动作推理能力薄弱。研究结果表明,LLMs中的状态追踪源于下一标记头的分布式交互,而非显式的符号计算。
Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels
Abstract
arXiv:2505.14925v1 Announce Type: cross Abstract: Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn't Model (TLDM) benchmark, which tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond "lost in the middle" benchmarks when evaluating model performance in complex long-context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.
摘要
尽管大型语言模型(LLMs)的上下文长度已扩展至数百万标记,但评估其在"大海捞针"式测试之外的有效性仍具挑战。我们认为小说可作为研究复杂精细结构及长程语义依赖(通常超过128k标记)的理想案例。受计算小说分析研究的启发,我们发布了"太长未建模"(TLDM)基准测试,用于评估模型在情节摘要复述、故事世界构型识别及叙事时间跨度推算方面的能力。测试发现,七款前沿LLMs在超过64k标记后均无法保持稳定理解。结果表明,语言模型开发者必须超越"中间丢失"类基准,才能准确评估模型在复杂长上下文场景中的表现。为促进后续研究,我们同步公开了TLDM基准测试及其参考代码与数据集。
JARVIS: A Multi-Agent Code Assistant for High-Quality EDA Script Generation
Abstract
arXiv:2505.14978v1 Announce Type: cross Abstract: This paper presents JARVIS, a novel multi-agent framework that leverages Large Language Models (LLMs) and domain expertise to generate high-quality scripts for specialized Electronic Design Automation (EDA) tasks. By combining a domain-specific LLM trained with synthetically generated data, a custom compiler for structural verification, rule enforcement, code fixing capabilities, and advanced retrieval mechanisms, our approach achieves significant improvements over state-of-the-art domain-specific models. Our framework addresses the challenges of data scarcity and hallucination errors in LLMs, demonstrating the potential of LLMs in specialized engineering domains. We evaluate our framework on multiple benchmarks and show that it outperforms existing models in terms of accuracy and reliability. Our work sets a new precedent for the application of LLMs in EDA and paves the way for future innovations in this field.
摘要
本文提出JARVIS——一种新型多智能体框架,该框架通过结合大语言模型(LLM)与领域专业知识,为专用电子设计自动化(EDA)任务生成高质量脚本。我们的方法整合了基于合成数据训练的领域专用LLM、用于结构验证的自定义编译器、规则强制执行与代码修复功能以及高级检索机制,相比当前最先进的领域专用模型取得了显著改进。该框架有效解决了LLM在专业工程领域中面临的数据稀缺和幻觉错误等挑战,彰显了LLM在专业工程领域的应用潜力。我们在多个基准测试上评估了该框架,结果表明其在准确性和可靠性方面均优于现有模型。本研究为LLM在EDA领域的应用树立了新标杆,并为该领域的未来创新奠定了基础。
Programmatic Video Prediction Using Large Language Models
Abstract
arXiv:2505.14948v1 Announce Type: cross Abstract: The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes. For applications such as video surveillance, robotics applications, autonomous driving, etc. this objective entails synthesizing plausible visual futures, given a few frames of a video to set the visual context. Towards this end, we propose ProgGen, which undertakes the task of video frame prediction by representing the dynamics of the video using a set of neuro-symbolic, human-interpretable set of states (one per frame) by leveraging the inductive biases of Large (Vision) Language Models (LLM/VLM). In particular, ProgGen utilizes LLM/VLM to synthesize programs: (i) to estimate the states of the video, given the visual context (i.e. the frames); (ii) to predict the states corresponding to future time steps by estimating the transition dynamics; (iii) to render the predicted states as visual RGB-frames. Empirical evaluations reveal that our proposed method outperforms competing techniques at the task of video frame prediction in two challenging environments: (i) PhyWorld (ii) Cart Pole. Additionally, ProgGen permits counter-factual reasoning and interpretable video generation attesting to its effectiveness and generalizability for video generation tasks.
摘要
估计描述现实世界过程动态的世界模型这一任务,对于预测和准备未来结果具有重要意义。在视频监控、机器人应用、自动驾驶等应用中,该目标需要在给定少量视频帧设定视觉上下文的情况下,合成合理的视觉未来。为此,我们提出了ProgGen,它通过利用大型(视觉)语言模型(LLM/VLM)的归纳偏置,将视频动态表示为一组神经符号化、人类可解释的状态(每帧一个状态),从而完成视频帧预测任务。具体而言,ProgGen利用LLM/VLM合成程序:(i)在给定视觉上下文(即帧)的情况下估计视频状态;(ii)通过估计过渡动态预测未来时间步对应的状态;(iii)将预测状态渲染为视觉RGB帧。实证评估表明,我们提出的方法在两个具有挑战性的环境(i)PhyWorld和(ii)Cart Pole中,在视频帧预测任务上优于竞争技术。此外,ProgGen支持反事实推理和可解释的视频生成,证明了其在视频生成任务中的有效性和泛化能力。
STree: Speculative Tree Decoding for Hybrid State-Space Models
Abstract
arXiv:2505.14969v1 Announce Type: cross Abstract: Speculative decoding is a technique to leverage hardware concurrency to improve the efficiency of large-scale autoregressive (AR) Transformer models by enabling multiple steps of token generation in a single forward pass. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead to current SSM state update implementations. With the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code will be released upon paper acceptance.
摘要
推测解码是一种利用硬件并发性提升自回归(AR)Transformer模型效率的技术,通过单次前向传播实现多步令牌生成。状态空间模型(SSMs)本身已比AR Transformer更高效,因其状态可汇总所有历史数据,无需缓存或重新处理滑动窗口上下文中的令牌。然而,其状态也可能包含数千个令牌,因此推测解码技术近期被扩展至SSMs领域。但现有方法未能利用基于树的验证机制,因当前SSMs缺乏高效计算令牌树的方法。我们提出首个可扩展算法,用于在状态空间模型(SSMs)及SSM与Transformer层的混合架构中实现基于树的推测解码。通过利用累积状态转移矩阵的结构,我们在现有SSM状态更新实现上以最小开销实现了基于树的推测解码。基于该算法,我们提出一种硬件感知的实现方案,改进了AR Transformer树基推测解码方法在SSMs中的直接应用。实验表明,在三个不同基准测试中,即使采用基线草稿模型和树结构,我们的方法仍优于SSMs的原始推测解码,为SSM及混合模型推理的进一步加速开辟了新途径。代码将在论文录用后公开。
Meta-Design Matters: A Self-Design Multi-Agent System
Abstract
arXiv:2505.14996v1 Announce Type: cross Abstract: Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks. However, most current MAS depend on manually designed agent roles and communication protocols. These manual designs often fail to align with the underlying LLMs' strengths and struggle to adapt to novel tasks. Recent automatic MAS approaches attempt to mitigate these limitations but typically necessitate a validation-set for tuning and yield static MAS designs lacking adaptability during inference. We introduce SELF-MAS, the first self-supervised, inference-time only framework for automatic MAS design. SELF-MAS employs meta-level design to iteratively generate, evaluate, and refine MAS configurations tailored to each problem instance, without requiring a validation set. Critically, it enables dynamic agent composition and problem decomposition through meta-feedback on solvability and completeness. Experiments across math, graduate-level QA, and software engineering benchmarks, using both closed-source and open-source LLM back-bones of varying sizes, demonstrate that SELF-MAS outperforms both manual and automatic MAS baselines, achieving a 7.44% average accuracy improvement over the next strongest baseline while maintaining cost-efficiency. These findings underscore the promise of meta-level self-supervised design for creating effective and adaptive MAS.
摘要
利用大型语言模型(LLM)强大能力的多智能体系统(MAS)在解决复杂任务方面具有重要潜力。然而,当前大多数MAS依赖于人工设计的智能体角色与通信协议。这些人工设计往往无法充分发挥底层LLM的优势,且难以适应新任务。近期自动化的MAS方法试图缓解这些限制,但通常需要验证集进行调优,并产生缺乏推理阶段适应性的静态MAS设计。我们提出SELF-MAS——首个仅需推理阶段的自监督自动化MAS设计框架。该方法通过元级设计迭代生成、评估并优化针对每个问题实例的MAS配置,无需验证集。其核心在于通过可解性与完备性的元反馈,实现动态的智能体组合与问题分解。在数学、研究生水平QA及软件工程基准测试上的实验表明(使用不同规模的闭源与开源LLM骨干模型),SELF-MAS在保持成本效益的同时,其性能优于人工与自动化MAS基线方法,平均准确率较次优基线提升7.44%。这些发现印证了元级自监督设计在构建高效自适应MAS方面的潜力。
Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision
Abstract
arXiv:2505.14999v1 Announce Type: cross Abstract: Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.
摘要
数学推理对大型语言模型(LLM)构成重大挑战,通常需要强大的多步逻辑一致性。虽然思维链(CoT)提示能够引发推理步骤,但不能保证正确性,而通过大量采样提高可靠性又会导致计算成本高昂。本文提出能量结果奖励模型(EORM),一种高效、轻量级的事后验证器。EORM基于能量模型(EBM),通过学习仅使用结果标签为CoT解决方案分配标量能量分数,简化了奖励模型的训练,从而避免了详细标注。其实现方式是将判别器输出逻辑值解释为负能量,有效对候选方案进行排序——为导致正确最终结果的解决方案分配较低能量,隐式地偏好连贯推理。在数学基准测试(GSM8k、MATH)上,EORM显著提高了最终答案准确率(例如,使用Llama 3 8B模型时,在GSM8k上达到90.7%,在MATH上达到63.7%)。EORM能有效利用给定的候选解决方案池,达到或超越暴力采样的性能,从而通过其高效的事后验证流程提升LLM推理结果的可靠性。
Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
Abstract
arXiv:2505.15038v1 Announce Type: cross Abstract: Linear Concept Vectors have proven effective for steering large language models (LLMs). While existing approaches like linear probing and difference-in-means derive these vectors from LLM hidden representations, diverse data introduces noises (i.e., irrelevant features) that challenge steering robustness. To address this, we propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which uses Sparse Autoencoders to filter out noisy features from hidden representations. When applied to linear probing and difference-in-means, our method improves their steering success rates. We validate our noise hypothesis through counterfactual experiments and feature visualizations.
摘要
线性概念向量已被证明能有效引导大语言模型(LLMs)。现有方法如线性探测和均值差异从LLM隐藏表示中推导这些向量,但多样化数据会引入噪声(即无关特征),影响引导的稳健性。为此,我们提出稀疏自编码去噪概念向量(SDCV),利用稀疏自编码器从隐藏表示中滤除噪声特征。当应用于线性探测和均值差异方法时,本方法显著提升了其引导成功率。我们通过反事实实验和特征可视化验证了噪声假设的合理性。
One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks
Abstract
arXiv:2505.15009v1 Announce Type: cross Abstract: We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.
摘要
我们研究了一层Transformer模型在无噪声和有噪声上下文推理中进行下一词预测时的近似能力与收敛行为。现有理论成果主要关注对首次梯度步长或无限样本情况下的上下文推理行为的理解,且尚未涉及收敛速率或泛化能力的分析。本研究通过证明存在一类具有线性注意力机制和ReLU注意力机制的单层Transformer可被理论证实为贝叶斯最优,填补了这些空白。在梯度下降训练过程中,我们通过有限样本分析表明这些Transformer的期望损失以线性速率收敛至贝叶斯风险。此外,我们证明了训练后的模型不仅能泛化到未见样本,还展现出先前实证研究中观察到的学习行为。大量实证验证进一步支持了我们的理论发现。
Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Abstract
arXiv:2505.15075v1 Announce Type: cross Abstract: The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
摘要
多模态大语言模型(MLLMs)的快速发展显著提升了其现实应用能力。然而,在跨语言场景下(尤其是涉及文化知识整合时)保持性能一致性仍存在重大挑战。为系统评估该问题,我们提出两个新基准:KnowRecall和VisRecall,用于评测MLLMs的跨语言一致性。KnowRecall作为视觉问答基准,通过15种语言测试全球地标的文化历史类问题,衡量事实知识一致性;VisRecall则要求模型在无图像条件下用9种语言描述地标外观,评估视觉记忆一致性。实验表明,包括商业模型在内的最先进MLLMs仍难以实现跨语言一致性,这凸显了需要开发更健壮的方法来构建真正多语言且具备文化认知能力的模型。
RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
Abstract
arXiv:2505.15034v1 Announce Type: cross Abstract: Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
摘要
强化学习(RL)近期已成为提升大语言模型(LLM)推理能力的重要方法,其核心是通过验证器(奖励模型)引导作为策略的LLM生成器。然而,当前LLM的RL后训练方法通常采用固定验证器(基于规则或冻结预训练模型)或通过监督微调(SFT)训练的判别式验证器。此类设计易受奖励破解影响,且在训练分布之外泛化能力较差。为突破这些限制,我们提出Tango框架——一种通过RL交替训练LLM生成器与验证器的新方法。Tango的核心创新在于其生成式、过程级LLM验证器,该验证器通过RL训练并与生成器协同进化。值得注意的是,验证器仅基于结果级验证正确性奖励进行训练,无需显式过程级标注。与确定性或SFT训练的验证器相比,这种RL训练的生成式验证器展现出更强的鲁棒性和泛化能力,有效促进与生成器的双向增强。大量实验表明,Tango的双组件在7B/8B规模模型中均取得最先进成果:生成器在五项竞赛级数学基准和四项跨领域推理任务中达到同类最佳性能,验证器则在ProcessBench数据集上领先。值得注意的是,双组件在最高难度数学推理问题上均表现出显著提升。代码见:https://github.com/kaiwenzha/rl-tango。
ChartCards: A Chart-Metadata Generation Framework for Multi-Task Chart Understanding
Abstract
arXiv:2505.15046v1 Announce Type: cross Abstract: The emergence of Multi-modal Large Language Models (MLLMs) presents new opportunities for chart understanding. However, due to the fine-grained nature of these tasks, applying MLLMs typically requires large, high-quality datasets for task-specific fine-tuning, leading to high data collection and training costs. To address this, we propose ChartCards, a unified chart-metadata generation framework for multi-task chart understanding. ChartCards systematically synthesizes various chart information, including data tables, visualization code, visual elements, and multi-dimensional semantic captions. By structuring this information into organized metadata, ChartCards enables a single chart to support multiple downstream tasks, such as text-to-chart retrieval, chart summarization, chart-to-table conversion, chart description, and chart question answering. Using ChartCards, we further construct MetaChart, a large-scale high-quality dataset containing 10,862 data tables, 85K charts, and 170 K high-quality chart captions. We validate the dataset through qualitative crowdsourcing evaluations and quantitative fine-tuning experiments across various chart understanding tasks. Fine-tuning six different models on MetaChart resulted in an average performance improvement of 5% across all tasks. The most notable improvements are seen in text-to-chart retrieval and chart-to-table tasks, with Long-CLIP and Llama 3.2-11B achieving improvements of 17% and 28%, respectively.
摘要
多模态大语言模型(MLLMs)的出现为图表理解带来了新的机遇。然而,由于这类任务具有细粒度特性,应用MLLMs通常需要大规模高质量数据集进行任务特定微调,导致数据收集和训练成本高昂。为此,我们提出ChartCards——一个支持多任务图表理解的统一图表元数据生成框架。该框架系统化合成各类图表信息,包括数据表格、可视化代码、视觉元素以及多维度语义描述。通过将这些信息组织为结构化元数据,ChartCards使得单个图表可支持多种下游任务,例如文本到图表检索、图表摘要、图表转表格、图表描述和图表问答。基于ChartCards,我们进一步构建了MetaChart数据集,这个大规模高质量数据集包含10,862个数据表格、85K张图表和170K条优质图表描述。我们通过定性众包评估和跨多种图表理解任务的定量微调实验验证了数据集质量。在MetaChart上微调的六种不同模型,所有任务平均性能提升达5%。其中文本到图表检索和图表转表格任务提升最为显著,Long-CLIP和Llama 3.2-11B模型分别实现了17%和28%的性能提升。
PiFlow: Principle-aware Scientific Discovery with Multi-Agent Collaboration
Abstract
arXiv:2505.15047v1 Announce Type: cross Abstract: Large Language Model (LLM)-based multi-agent systems (MAS) demonstrate remarkable potential for scientific discovery. Existing approaches, however, often automate scientific discovery using predefined workflows that lack rationality constraints. This often leads to aimless hypothesizing and a failure to consistently link hypotheses with evidence, thereby hindering systematic uncertainty reduction. Overcoming these limitations fundamentally requires systematic uncertainty reduction. We introduce \texttt{PiFlow}, an information-theoretical framework, treating automated scientific discovery as a structured uncertainty reduction problem guided by principles (e.g., scientific laws). In evaluations across three distinct scientific domains -- discovering nanomaterial structures, bio-molecules, and superconductor candidates with targeted properties -- our method significantly improves discovery efficiency, reflected by a 73.55% increase in the Area Under the Curve (AUC) of property values versus exploration steps, and enhances solution quality by 94.06% compared to a vanilla agent system. Overall, \texttt{PiFlow} serves as a Plug-and-Play method, establishing a novel paradigm shift in highly efficient automated scientific discovery, paving the way for more robust and accelerated AI-driven research. Code is publicly available at our \href{https://github.com/amair-lab/PiFlow}{GitHub}.
摘要
基于大语言模型(LLM)的多智能体系统(MAS)在科学发现领域展现出显著潜力。然而,现有方法通常采用缺乏合理性约束的预定义工作流来实现科学发现自动化,这往往导致假设生成漫无目的,且无法持续将假设与证据相关联,从而阻碍系统性不确定性的降低。克服这些局限性的核心在于实现系统化的不确定性消减。我们提出\texttt{PiFlow}——一个信息理论框架,将自动化科学发现视为受科学定律等原则指导的结构化不确定性消减问题。在三个不同科学领域(具有目标特性的纳米材料结构发现、生物分子发现和超导体候选材料发现)的评估中,本方法显著提升了发现效率(属性值与探索步骤的曲线下面积AUC提升73.55%),并将解决方案质量较基线智能体系统提高94.06%。总体而言,\texttt{PiFlow}作为一种即插即用方法,建立了高效自动化科学发现的新范式,为更稳健、更快速的人工智能驱动研究铺平了道路。代码已公开于\href{https://github.com/amair-lab/PiFlow}{GitHub}。
DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
Abstract
arXiv:2505.15074v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups - assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks.
摘要
大型语言模型(LLMs)正通过基于人类反馈的强化学习(RLHF)日益与人类偏好对齐。在众多RLHF方法中,组相对策略优化(GRPO)因其简洁性和卓越性能备受关注,其显著特点是无需学习价值函数。然而,GRPO隐含假设了均衡的领域分布和跨组语义对齐的一致性——这种假设在现实数据集中几乎无法成立。当应用于多领域不平衡数据时,GRPO会过度优化主导领域,忽视弱势领域,导致泛化能力与公平性下降。我们提出领域感知自洽策略优化(DISCO),这是GRPO的原则性扩展,通过两项关键创新解决组间不平衡问题:基于领域频率的奖励缩放通过领域流行度重加权优化来抵消频率偏差;难度感知奖励缩放利用提示级自洽性识别并优先处理具有更高学习价值的不确定性提示。这两种策略共同促进了跨领域更公平有效的策略学习。在多个LLM和倾斜训练分布上的大量实验表明,DISCO显著提升泛化能力,在Qwen3模型上以5%优势超越现有GRPO变体,并在多领域对齐基准测试中创造了新的最优性能记录。
Leveraging Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects
Abstract
arXiv:2505.15088v1 Announce Type: cross Abstract: Command injection vulnerabilities are a significant security threat in dynamic languages like Python, particularly in widely used open-source projects where security issues can have extensive impact. With the proven effectiveness of Large Language Models(LLMs) in code-related tasks, such as testing, researchers have explored their potential for vulnerabilities analysis. This study evaluates the potential of large language models (LLMs), such as GPT-4, as an alternative approach for automated testing for vulnerability detection. In particular, LLMs have demonstrated advanced contextual understanding and adaptability, making them promising candidates for identifying nuanced security vulnerabilities within code. To evaluate this potential, we applied LLM-based analysis to six high-profile GitHub projects-Django, Flask, TensorFlow, Scikit-learn, PyTorch, and Langchain-each with over 50,000 stars and extensive adoption across software development and academic research. Our analysis assesses both the strengths and limitations of LLMs in detecting command injection vulnerabilities, evaluating factors such as detection accuracy, efficiency, and practical integration into development workflows. In addition, we provide a comparative analysis of different LLM tools to identify those most suitable for security applications. Our findings offer guidance for developers and security researchers on leveraging LLMs as innovative and automated approaches to enhance software security.
摘要
命令注入漏洞是Python等动态语言中的重大安全威胁,尤其在广泛使用的开源项目中,此类安全问题可能产生深远影响。随着大语言模型(LLMs)在代码相关任务(如测试)中有效性得到验证,研究者开始探索其在漏洞分析中的潜力。本研究评估了GPT-4等大语言模型作为自动化漏洞检测替代方案的可行性。LLMs展现出先进的上下文理解能力和适应性,使其成为识别代码中复杂安全漏洞的有力候选。为验证这一潜力,我们对六个知名GitHub项目(Django、Flask、TensorFlow、Scikit-learn、PyTorch和Langchain)进行了基于LLM的分析,这些项目均拥有超过5万星标并在软件开发和学术研究中广泛应用。我们的分析评估了LLMs在检测命令注入漏洞时的优势与局限,包括检测准确性、效率以及与开发流程的实际整合度。此外,我们通过对比不同LLM工具,筛选出最适合安全应用的模型。研究结果为开发者和安全研究人员提供了利用LLMs作为创新自动化方案来增强软件安全性的实践指导。
Self-GIVE: Associative Thinking from Limited Structured Knowledge for Enhanced Large Language Model Reasoning
Abstract
arXiv:2505.15062v1 Announce Type: cross Abstract: When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate "hormones helping mental disorders" with "melatonin being a hormone and insomnia a mental disorder" to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE's key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to \textbf{28.5%\rightarrow and \textbf{78.6\rightarrow in samples \textbf{unseen} in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.
摘要
在处理需要新信息的复杂问题时,人们常通过将问题与既有知识关联来推导合理答案。例如 评估褪黑激素是否改善失眠时,可能将"激素有助于精神障碍"与"褪黑激素是激素且失眠属于精神障碍"相关联来完成推理。大语言模型(LLMs)同样需要这种关联思维,尤其在检索知识不足且无法直接回答问题时的科学查询场景。图启发真实性外推法(GIVE)通过知识图谱(KG)实现结构化知识外推,但涉及大量假设三元组的构建与剪枝,制约了效率与泛化能力。我们提出Self-GIVE检索强化学习框架,通过强化学习使LLMs具备自动关联思维能力。该方法提取结构化信息与实体集以辅助模型连接查询概念,解决了GIVE的三个关键局限:(1)知识外推需要大量LLM调用与token开销;(2)复杂指令导致难以部署于小型LLMs(3B/7B);(3)LLM剪枝产生的知识不准确。具体而言,在使用135节点UMLS KG进行self-GIVE微调后,Qwen2.5的3B与7B模型在生物医学QA难题未见样本中的表现分别提升至\textbf{28.5%\rightarrow和\textbf{78.6\rightarrow。特别地,Self-GIVE使7B模型达到或超越GPT3.5 turbo搭配GIVE的表现,同时减少90%以上的token消耗。该方法增强了结构化检索与关联思维推理的可扩展集成。
StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization
Abstract
arXiv:2505.15107v1 Announce Type: cross Abstract: Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our implementation is publicly available at https://github.com/zxh20001117/StepSearch.
摘要
高效的多跳推理要求基于大语言模型(LLM)的智能体通过迭代获取高价值外部知识。先前研究探索了利用强化学习(RL)训练LLM执行基于搜索的文档检索,在问答性能上取得显著提升,但在仅依赖全局稀疏奖励信号的复杂多跳问答任务中表现欠佳。为填补这一研究空白,我们提出StepSearch框架——采用逐步近端策略优化方法训练的搜索型LLM系统。该框架包含基于信息增益与冗余惩罚的更丰富中间搜索奖励机制,以及细粒度的词级过程监督,以更好地指导每个搜索步骤。通过数据管道方法,我们在开源数据集基础上构建了包含子问题级搜索轨迹的细粒度问答数据集。在标准多跳问答基准测试中,该方法显著优于全局奖励基线模型:3B和7B参数模型仅使用19k训练数据,就在各类RL搜索基线上分别实现11.2%和4.2%的绝对性能提升,证明了细粒度逐步监督对深度搜索LLM优化的有效性。项目代码已开源:https://github.com/zxh20001117/StepSearch。
ThinkRec: Thinking-based recommendation via LLM
Abstract
arXiv:2505.15091v1 Announce Type: cross Abstract: Recent advances in large language models (LLMs) have enabled more semantic-aware recommendations through natural language generation. Existing LLM for recommendation (LLM4Rec) methods mostly operate in a System 1-like manner, relying on superficial features to match similar items based on click history, rather than reasoning through deeper behavioral logic. This often leads to superficial and erroneous recommendations. Motivated by this, we propose ThinkRec, a thinking-based framework that shifts LLM4Rec from System 1 to System 2 (rational system). Technically, ThinkRec introduces a thinking activation mechanism that augments item metadata with keyword summarization and injects synthetic reasoning traces, guiding the model to form interpretable reasoning chains that consist of analyzing interaction histories, identifying user preferences, and making decisions based on target items. On top of this, we propose an instance-wise expert fusion mechanism to reduce the reasoning difficulty. By dynamically assigning weights to expert models based on users' latent features, ThinkRec adapts its reasoning path to individual users, thereby enhancing precision and personalization. Extensive experiments on real-world datasets demonstrate that ThinkRec significantly improves the accuracy and interpretability of recommendations. Our implementations are available in anonymous Github: https://anonymous.4open.science/r/ThinkRec_LLM.
摘要
大语言模型(LLM)的最新进展通过自然语言生成实现了更具语义感知的推荐。现有基于LLM的推荐方法(LLM4Rec)大多以类似系统1的方式运作,依赖表层特征根据点击历史匹配相似项目,而非通过更深层的行为逻辑进行推理,这往往导致推荐结果流于表面且存在错误。受此启发,我们提出ThinkRec——一个基于思考的框架,将LLM4Rec从系统1转向系统2(理性系统)。技术上,ThinkRec引入了思考激活机制,通过关键词摘要增强项目元数据并注入合成推理轨迹,引导模型形成可解释的推理链,包括分析交互历史、识别用户偏好以及基于目标项目做出决策。在此基础上,我们提出实例级专家融合机制以降低推理难度。通过根据用户潜在特征动态分配专家模型权重,ThinkRec能针对个体用户调整推理路径,从而提升推荐的精确性与个性化。在真实数据集上的大量实验表明,ThinkRec显著提高了推荐的准确性和可解释性。实现代码已发布于匿名GitHub:https://anonymous.4open.science/r/ThinkRec_LLM。
DeFTX: Denoised Sparse Fine-Tuning for Zero-Shot Cross-Lingual Transfer
Abstract
arXiv:2505.15090v1 Announce Type: cross Abstract: Effective cross-lingual transfer remains a critical challenge in scaling the benefits of large language models from high-resource to low-resource languages. Towards this goal, prior studies have explored many approaches to combine task knowledge from task-specific data in a (high-resource) source language and language knowledge from unlabeled text in a (low-resource) target language. One notable approach proposed composable sparse fine-tuning (SFT) for cross-lingual transfer that learns task-specific and language-specific sparse masks to select a subset of the pretrained model's parameters that are further fine-tuned. These sparse fine-tuned vectors (SFTs) are subsequently composed with the pretrained model to facilitate zero-shot cross-lingual transfer to a task in a target language, using only task-specific data from a source language. These sparse masks for SFTs were identified using a simple magnitude-based pruning. In our work, we introduce DeFT-X, a novel composable SFT approach that denoises the weight matrices of a pretrained model before magnitude pruning using singular value decomposition, thus yielding more robust SFTs. We evaluate DeFT-X on a diverse set of extremely low-resource languages for sentiment classification (NusaX) and natural language inference (AmericasNLI) and demonstrate that it performs at par or outperforms SFT and other prominent cross-lingual transfer baselines.
摘要
实现有效的跨语言迁移仍是扩大大型语言模型从高资源语言向低资源语言应用效益的关键挑战。为此,先前研究探索了多种方法,旨在结合(高资源)源语言中任务特定数据所蕴含的任务知识,以及(低资源)目标语言中未标注文本所包含的语言知识。其中一种显著方法是提出可组合稀疏微调(SFT)技术,该方法通过学习任务特定和语言特定的稀疏掩码,从预训练模型参数中选择子集进行进一步微调。这些稀疏微调向量(SFTs)随后与预训练模型组合,仅需源语言的任务特定数据即可促进目标语言任务的零样本跨语言迁移。原始SFT稀疏掩码通过简单的基于幅度的剪枝方法确定。本研究中,我们提出DeFT-X——一种新颖的可组合SFT方法,该方法在幅度剪枝前利用奇异值分解对预训练模型权重矩阵进行去噪,从而生成更具鲁棒性的SFTs。我们在情感分类(NusaX)和自然语言推理(AmericasNLI)任务中针对多种极低资源语言评估DeFT-X,结果表明其性能与SFT相当或优于SFT及其他主流跨语言迁移基线方法。
SUS backprop: linear backpropagation algorithm for long inputs in transformers
Abstract
arXiv:2505.15080v1 Announce Type: cross Abstract: It is straightforward to design an unbiased gradient estimator that stochastically cuts the backpropagation flow through any part of a computational graph. By cutting the parts that have little effect on the computation, one can potentially save a significant amount of back-propagation computation in exchange for a minimal increase in the stochastic gradient variance, in some situations. Such a situation occurs in the attention mechanism of the transformer architecture. For long sequences, attention becomes the limiting factor, as its compute requirements increase quadratically with sequence length . At the same time, most attention weights become very small, as most attention heads tend to connect a given token with only a small fraction of other tokens in the sequence. These weights become promising targets for cutting backpropagation. We propose a simple probabilistic rule controlled by a single parameter that cuts backpropagation through most attention weights, leaving at most interactions per token per attention head. This brings a factor of reduction in the compute required for the attention backpropagation, turning it from quadratic to linear complexity . We have empirically verified that, for a typical transformer model, cutting of the attention gradient flow (i.e. choosing ) results in relative gradient variance increase of only about for , and it decreases with . This approach is amenable to efficient sparse matrix implementation, thus being promising for making the cost of a backward pass negligible relative to the cost of a forward pass when training a transformer model on long sequences.